5.1.2. Data¶
5.1.2.1. Criteria for choosing datasets¶
Datasets were chosen from the EBI’s Gene Expression Atlas (GxA). A major benefit of the GxA is that raw data using the same sequencing technology are re-analysed by GxA using the same data analysis pipeline (iRAP[100] for RNA-Seq). In addition to ensuring the quality of each data set included, and running it through the same pipeline, the GxA adds additional metadata for the experiments by using the literature to biologically and technically annotate each sample.
Data sets from the GxA were chosen based on the following requirements.
Experiments must be measuring baseline, rather than differential gene expression.
Samples must be sequenced using Next Generation Sequencing, i.e. including RNA-Seq and CAGE, and excluding microarrays.
Data sets must contain a breadth of tissues and genes. This is to aid batch correction by facilitating the most balanced data set design in terms of batch (experiment) to group (tissue), and in order to have good coverage of genes and tissues, which is necessary for downstream use. i.e. experiments must include “organism part” as an experimental factor (otherwise tissue would not be recorded) and must have at least 80 assays (samples).
Samples must not be disease-focused. In practice, excluding cancer datasets was enough to exclude disease-focused datasets.
5.1.2.1.1. Next Generation Sequencing¶
As described in the introduction chapter, there are many ways to measure which proteins are being created. Here, I justify my choices of measures to include in the combined dataset.
5.1.2.1.1.1. Gene expression vs protein abundance¶
Gene expression levels are not necessarily strongly correlated with protein abundance; this has been found in mice[101], yeast[102], and human[103]. In human, Spearman correlations between protein abundance and gene expression levels vary between 0.36 and 0.50, depending on tissue, meaning that they are only weakly or moderately correlated[103]. There are many interacting reasons why this is the case. One reason is that there is something preventing the mRNA from being translated, such as slow codons, the temperature, ribosome occupancy, or regulatory RNAs and proteins[104]. In these cases, the DNA is transcribed into mRNA, but the protein is never produced, meaning that using gene expression data as a measure of how much protein is produced would be overestimating the protein abundance. If these factors were a large contribution to the weak correlation, it could provide better results to use protein abundance data instead of mRNA abundance data to make predictions about how proteins are affecting human phenotypes. On the other hand, it could equally be possible that proteins are being produced, but not measured by protein abundance techniques. Protein half-lives range over orders of magnitude from seconds to days[104][105]. In this case, gene expression data may be a more reliable measure of protein production than protein abundance, since proteins may degrade before being measured. In yeast, protein degradation was shown to be the largest contribution to the protein-mRNA correlation compared to codon and amino acid usage (the two other factors estimated in the study), and more influential than those other two factors combined[106].
In summary, there’s no perfect measure of translation, but since gene expression data is more readily available, and protein degradation appears to account for most of the differences between correlations, gene expression data presents the best proxy for translation for the downstream uses discussed here.
5.1.2.1.1.2. Gene expression vs Transcript expression¶
It’s likely that transcript expression data would provide more insight than gene expression data if it were available, since it is likely that there are tissue-specific transcripts which do not correspond to tissue-specific genes, e.g. where different transcripts from the same gene are expressed in different tissues. Transcript expression data, however, is harder to come by and this approach relies on a wealth of available data. Furthermore, transcript expression data can be straightforwardly converted to gene expression data (by summing over the transcripts), while the conversion of gene to transcript expression data is decidedly less accurate. When transcript-expression (CAGE) measurements are aggregated at the gene/protein level, measures of tissue-specificity have been found to largely (75-93%) match up with measures of tissue-specificity resulting from gene-expression measurements, as found in a comparison between the HPA and FANTOM5 experiments[107].
For these reasons, I have taken a gene-centric approach here. It may be important, however, to consider whether a gene has multiple transcripts in downstream analysis, for example, if including tissue-specific gene expression information when predicting the function of a protein-coding SNV (since it may not be in the relevant transcript).
5.1.2.1.1.3. Inclusion of CAGE data¶
CAGE is transcript expression, rather than gene expression, and there are likely to be different transcripts measured by CAGE than by RNA-Seq. As mentioned above, however, it is possible to calculate gene expression from transcript expression. It’s also possible to map between CAGE transcription start sites and existing transcript IDs that may be featured in RNA-Seq arrays. When this is done, it has been observed that the results of CAGE are comparable to those of RNA-seq, so the inclusion of CAGE data in a combined dataset is reasonable.
“We found that the quantified levels of gene expression are largely comparable across platforms and conclude that CAGE and RNA-seq are complementary technologies that can be used to improve incomplete gene models”[108]
5.1.2.1.2. Excluding disease-focused experiments¶
The decision to exclude disease-focused experiments was made primarily to reduce the complexity of the analysis and the resulting data set. The data set can now be interpreted as representing gene expression of healthy tissues. This was also a practical choice since most disease data sets (with the exception of cancer datasets) tended to have a narrow breadth of tissues. For example, experiments interested in heart disease would naturally contain measurements of healthy and non-healthy heart tissues, and not other tissues, so would be difficult to combine with existing data sets due to the “missing” data. This would not have been a proplem for cancer experiments, however cancer is known to be tissue-non-specific[25][2].
5.1.2.2. Method of searching¶
It would have been preferable to interrogate the GxA for datasets using the ExpressionAtlas R package, or the AtlasExpress API which it is built on. However, it was necessary to do so via the website since searches to the AtlasExpress API cannot differentiate between baseline and differential gene expression.
For reproducibility, this was done by downloading the json file used by the GxA experiment browser webservice and by choosing experiments with:
baseline (rather than differential) expression measurements
homo sapiens species
RNA-Seq mRNA technology
organism part as an experimental factor
at least 80 assays.
no mention of “cancer” in the description
libraries <- function(){
library(httr)
library(jsonlite)
library(tidyverse)
library(plotly)
library(IRdisplay)
library(ExpressionAtlas)
library(htmlwidgets)
}
suppressMessages(libraries())
analysis_date <- "2019-06-01" #YYYY-mm-dd
min_assays <- 80
# Download data
gxa_json <- 'https://www.ebi.ac.uk/gxa/json/experiments'
req <- httr::GET(gxa_json)
text <- httr::content(req, "text", encoding="UTF-8")
gxa_experiment_info <- as_tibble(jsonlite::fromJSON(text)$experiments)
# All data at time of analysis:
gxa_experiment_info$loadDate <- as.Date(gxa_experiment_info$loadDate, "%d-%m-%Y")
gxa_experiment_info <- gxa_experiment_info %>% filter(loadDate<analysis_date)
funnel_info <- tibble(name="All experiments",
num_experiments=nrow(gxa_experiment_info))
# All next-gen sequencing experiments:
gxa_experiment_info<- gxa_experiment_info %>%
filter(technologyType=='RNA-Seq mRNA')
funnel_info <- funnel_info %>%
add_row(name="RNA-Seq technology",
num_experiments=nrow(gxa_experiment_info))
# All baseline expression experiments
gxa_experiment_info<- gxa_experiment_info %>%
filter(experimentType=='Baseline')
baseline <- nrow(gxa_experiment_info)
funnel_info <- funnel_info %>%
add_row(name="Baseline experiments",
num_experiments=nrow(gxa_experiment_info))
# All human experiments
gxa_experiment_info<- gxa_experiment_info %>%
filter(species=='Homo sapiens')
funnel_info <- funnel_info %>%
add_row(name="Human",
num_experiments=nrow(gxa_experiment_info))
# All "organism part" experiments with > 80 assays:
gxa_experiment_info$experimentalFactors <- as.character(gxa_experiment_info$experimentalFactors)
gxa_experiment_info <- gxa_experiment_info %>%
filter(numberOfAssays>min_assays,
str_detect(experimentalFactors, 'organism part'))
funnel_info <- funnel_info %>%
add_row(name="Contains tissues and >80 samples",
num_experiments=nrow(gxa_experiment_info))
# Excluding cancer experiments:
gxa_experiment_info$experimentDescription <- as.character(gxa_experiment_info$experimentDescription)
gxa_experiment_info <- gxa_experiment_info %>%
filter(str_detect(experimentDescription,'[cC]ancer', negate=TRUE))
funnel_info <- funnel_info %>%
add_row(name="Not cancer experiments",
num_experiments=nrow(gxa_experiment_info))
create_fig <- function(){
fig <- plot_ly()
fig <- fig %>% add_trace(
type = "funnel",
y = as.character(funnel_info$name),
x = as.integer(funnel_info$num_experiments))
fig <- fig %>% layout(yaxis = list(categoryarray = as.character(funnel_info$name)))
# orca(fig, 'figs/funnel.png', width=800)
# IRdisplay::display_png(file='figs/funnel.png')
# saveWidget(fig, "figs/funnel_interactive.html", selfcontained = FALSE)
display(fig)
}
suppressWarnings(create_fig())
As funnel-combine-gxa shows, at the time of writing, there are over 3000 experiments in the GxA, and of these 27 are human baseline RNA-Seq experiments. Of these there are 4 which offer a good coverage of non-disease organism parts.
5.1.2.3. Chosen data sets¶
The data sets that were chosen to be used in the combined data set are shown in table-chosen-combine-gxa, and described below.
gxa_experiment_info <- gxa_experiment_info %>% add_column(shortName=c("HPA", "FANTOM5", "GTEx", "HDBR")) %>% select(shortName, experimentAccession, experimentDescription, numberOfAssays, experimentalFactors)
head(gxa_experiment_info)
| shortName | experimentAccession | experimentDescription | numberOfAssays | experimentalFactors |
|---|---|---|---|---|
| <chr> | <chr> | <chr> | <int> | <chr> |
| HPA | E-MTAB-2836 | RNA-seq of coding RNA from tissue samples of 122 human individuals representing 32 different tissues | 200 | organism part |
| FANTOM5 | E-MTAB-3358 | RNA-Seq CAGE (Cap Analysis of Gene Expression) analysis of human tissues in RIKEN FANTOM5 project | 96 | c("developmental stage", "organism part") |
| GTEx | E-MTAB-5214 | RNA-seq from 53 human tissue samples from the Genotype-Tissue Expression (GTEx) Project | 18736 | organism part |
| HDBR | E-MTAB-4840 | RNA-seq of coding RNA: Human Developmental Biology Resource (HDBR) expression resource of prenatal human brain development | 613 | c("developmental stage", "organism part") |
5.1.2.3.1. FANTOM5¶
The FANTOM5 experiment was described in the previous chapter.
5.1.2.3.2. Human Protein Atlas¶
The Human Protein Atlas (HPA) project[110][111] aims to map all human proteins in cells (including subcellular locations), tissues and organs. The HPA project’s data is not limited to the gene expression data that can be found in GxA, but that is the only part of the data that is used here. The gene expression data that was used (E-MTAB-2836 in GxA) excludes cell lines and includes tissue samples of 122 individuals and 32 different non-diseased tissue types.
5.1.2.3.3. Genotype Tissue Expression¶
The Genotype Tissue Expression (GTEx) project[112] was developed specifically for the purpose of studying tissue-specific gene expression in humans and gene expression data from over 18,000 samples, including 53 non-diseased tissue types and 550 individuals (ranging in age from 20s to 70s).
5.1.2.3.4. Human Developmental Biology Resource¶
The Human Developmental Biology Resource (HDBR) Expression data[113] is slightly different from the other data sets in that contains a much narrower range of sample types. All HDBR samples are human brain samples at different stages of development, ranging from 3 to 20 weeks after conception.
5.1.2.4. Data acquisition¶
Data was obtained, where possible via the ExpressionAtlas R package[1], which gives gene expression counts identified by ENSG IDs, metadata (containing pipeline, filtering, mapping and quantification information), and details of experimental design (containing for example organism part name, individual demographics, and replicate information, depending on the experiment).
For the FANTOM experiment counts for transcript expression were downloaded directly from the FANTOM website. The downloaded FANTOM5 file has already undergone some quality control by FANTOM, it is limited to peaks which meet a “robust” threshold (>10 read counts and 1TPM for at least one sample).
Page References
- 25
Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol., 15(12):550, 2014. URL: http://dx.doi.org/10.1186/s13059-014-0550-8.
- 100
Nuno A Fonseca, Robert Petryszak, John Marioni, and Alvis Brazma. iRAP - an integrated RNA-seq analysis pipeline. Preprint, June 2014. URL: http://dx.doi.org/10.1101/005991.
- 101
Björn Schwanhäusser, Dorothea Busse, Na Li, Gunnar Dittmar, Johannes Schuchhardt, Jana Wolf, Wei Chen, and Matthias Selbach. Global quantification of mammalian gene expression control. Nature, 473(7347):337–342, May 2011. URL: http://dx.doi.org/10.1038/nature10098.
- 102
S P Gygi, Y Rochon, B R Franza, and R Aebersold. Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol., 19(3):1720–1730, March 1999. URL: https://www.ncbi.nlm.nih.gov/pubmed/10022859.
- 103(1,2)
Idit Kosti, Nishant Jain, Dvir Aran, Atul J Butte, and Marina Sirota. Cross-tissue analysis of gene and protein expression in normal and cancer tissues. Sci. Rep., 6:24799, May 2016. URL: http://dx.doi.org/10.1038/srep24799.
- 104(1,2)
Tobias Maier, Marc Güell, and Luis Serrano. Correlation of mRNA and protein in complex biological samples. FEBS Lett., 583(24):3966–3973, December 2009. URL: http://dx.doi.org/10.1016/j.febslet.2009.10.036.
- 105
Andreas Beyer, Jens Hollunder, Heinz-Peter Nasheuer, and Thomas Wilhelm. Post-transcriptional expression regulation in the yeast saccharomyces cerevisiae on a genomic scale. Mol. Cell. Proteomics, 3(11):1083–1092, November 2004. URL: http://dx.doi.org/10.1074/mcp.M400099-MCP200.
- 106
Gang Wu, Lei Nie, and Weiwen Zhang. Integrative analyses of posttranscriptional regulation in the yeast saccharomyces cerevisiae using transcriptomic and proteomic data. Curr. Microbiol., 57(1):18–22, July 2008. URL: http://dx.doi.org/10.1007/s00284-008-9145-5.
- 107
Nancy Yiu-Lin Yu, Björn M Hallström, Linn Fagerberg, Fredrik Ponten, Hideya Kawaji, Piero Carninci, Alistair R R Forrest, Fantom Consortium, Yoshihide Hayashizaki, Mathias Uhlén, and Carsten O Daub. Complementing tissue characterization by integrating transcriptome profiling from the human protein atlas and from the FANTOM5 consortium. Nucleic Acids Res., 43(14):6787–6798, August 2015. URL: http://dx.doi.org/10.1093/nar/gkv608.
- 108
Hideya Kawaji, Marina Lizio, Masayoshi Itoh, Mutsumi Kanamori-Katayama, Ai Kaiho, Hiromi Nishiyori-Sueki, Jay W Shin, Miki Kojima-Ishiyama, Mitsuoki Kawano, Mitsuyoshi Murata, Noriko Ninomiya-Fukuda, Sachi Ishikawa-Kato, Sayaka Nagao-Sato, Shohei Noma, Yoshihide Hayashizaki, Alistair R R Forrest, Piero Carninci, and FANTOM Consortium. Comparison of CAGE and RNA-seq transcriptome profiling using clonally amplified and single-molecule next-generation sequencing. Genome Res., 24(4):708–717, April 2014. URL: http://dx.doi.org/10.1101/gr.156232.113.
- 2
Eitan E Winter, Leo Goodstadt, and Chris P Ponting. Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res., 14(1):54–61, January 2004. URL: http://dx.doi.org/10.1101/gr.1924004.
- 110
Mathias Uhlen, Per Oksvold, Linn Fagerberg, Emma Lundberg, Kalle Jonasson, Mattias Forsberg, Martin Zwahlen, Caroline Kampf, Kenneth Wester, Sophia Hober, Henrik Wernerus, Lisa Björling, and Fredrik Ponten. Towards a knowledge-based human protein atlas. Nature Biotechnology, 28(12):1248–1250, 2010. URL: http://dx.doi.org/10.1038/nbt1210-1248.
- 111
Mathias Uhlén, Linn Fagerberg, Björn M Hallström, Cecilia Lindskog, Per Oksvold, Adil Mardinoglu, Åsa Sivertsson, Caroline Kampf, Evelina Sjöstedt, Anna Asplund, Ingmarie Olsson, Karolina Edlund, Emma Lundberg, Sanjay Navani, Cristina Al-Khalili Szigyarto, Jacob Odeberg, Dijana Djureinovic, Jenny Ottosson Takanen, Sophia Hober, Tove Alm, Per-Henrik Edqvist, Holger Berling, Hanna Tegel, Jan Mulder, Johan Rockberg, Peter Nilsson, Jochen M Schwenk, Marica Hamsten, Kalle von Feilitzen, Mattias Forsberg, Lukas Persson, Fredric Johansson, Martin Zwahlen, Gunnar von Heijne, Jens Nielsen, and Fredrik Pontén. Proteomics. tissue-based map of the human proteome. Science, 347(6220):1260419, January 2015. URL: http://dx.doi.org/10.1126/science.1260419.
- 112
GTEx Consortium. The Genotype-Tissue expression (GTEx) project. Nat. Genet., 45(6):580–585, June 2013. URL: http://dx.doi.org/10.1038/ng.2653.
- 113
Susan J Lindsay, Yaobo Xu, Steven N Lisgo, Lauren F Harkin, Andrew J Copp, Dianne Gerrelli, Gavin J Clowry, Aysha Talbot, Michael J Keogh, Jonathan Coxhead, Mauro Santibanez-Koref, and Patrick F Chinnery. HDBR expression: a unique resource for global and individual gene expression studies during early human brain development. Front. Neuroanat., 10:86, October 2016. URL: http://dx.doi.org/10.3389/fnana.2016.00086.
- 1
M Keays. ExpressionAtlas: download datasets from EMBL-EBI expression atlas. R package version 1.10.0. 2018. URL: https://bioconductor.org/packages/release/bioc/html/ExpressionAtlas.html.